than three of you. I have given this talk in a room of three people. So hopefully...
Did they understand it? No, they had a lot of trouble. But we got to work back and forth
and figure it out. Anyway, so you're here to learn Elastic Search. I've got 50 minutes,
I believe. So we'll see if I can fit an hour-long talk within 50 minutes. Hi, my name is John
Berryman. Just to do a quick introduction to myself, you know, find me on Twitter. A
lot of my self-worth is derived from my Twitter followers. I was a... So growing up, I was
a pretty nerdy kid that started reading programming manuals when I was in like the first grade.
Ended up getting into aerospace engineering. That was my first career. Decided that satellites
and all that stuff were pretty cool, but I liked the programming and I liked the math.
I moved after about four years in the field. I got into search technology and I was consultant.
I wrote a book. That guy right there. Wouldn't necessarily recommend doing that with your
life, but it's a good calling card. And now I work at Eventbrite. I am a discovery engineer.
So search and recommendations and stuff like that. So to give you a little preview of what
we're going to talk about, this is not really an advertisement for Elastic Search, but a
lot of what we're doing involves my mental model thinking through the Eventbrite problem.
So just to give you a little shared background, historically the company I work at, Eventbrite,
has been a very organizer-focused startup. We allow organizers who want to put on their
own events to come to our website. You can build a nice little webpage with little effort
to get it all suspended. You can sell tickets. We take care of all the credit card mess.
You have a platform for messaging attendees and we get their metrics. So after the event
is done, you get to look back and make sure that your next events are as good or better
than this event. But after years of actually nailing down this side of the market pretty
well, my company realized that, look, we've got all this inventory. Everyone, we're basically
white-labeled, but everyone is plastering their events on our website. If we can turn
around and sell our inventory to everyone else, then organizers are happy, the customers
are happy because you can find something to do over the weekend, and we're hoping to
generate the so-called flywheel effect. This is exciting for me because this is where I
belong. Creating the marketplace is all about building search and browsing and recommendation
features for Eventbrite. Of course, this technology is based on Elasticsearch, what we're talking
about today. Can you guys keep it secret? So I know we're supposed to talk about Elasticsearch
today, but I've got to tell you, I'm actually more interested in talking about my new startup,
Eventbrite. Yep, so don't tell anybody, but I'm going to directly start competing against
Eventbrite. Our guiding principles, and I'm sorry to do this to you, I know we're supposed
to talk about Elasticsearch, but Elasticsearch is hard. So I'm going to focus this new startup,
and you guys can join me if you'd like. It's going to be focused on MySQL because everyone
knows how databases work. Databases are easy. Let's just build on a tried and true platform,
and let's not overthink it. Our specialty, because I found a free data thing online, is
cat-related events. We started cat-related events. Good. See, we have some attendees
already. Then we'll expand to other fields. All right. We have someone who will at least
buy our tickets, so we have a marketplace. Excellent. So building this new website is
going to be pretty easy. There's not really too much to an event. So here's our schema
with MySQL. We're going to have IDs, an integer, name, description, city, start, date. You
can look at all that. That makes pretty good, simple sense. My hypothesis that I know will
play out well is that we can build a website based on this. I'll demonstrate it. Here's
our event search. Select star from events. That gives us all the details we'll need
back for the website. We have date range search. Obviously, you'll need that to find something
this weekend. We have geo search. Not hard. Why invest in all that stuff? We can just
do string matching. Finally, it's easy to search for events that you like. I want to
find an event where the name equals cat. The results are nothing. Oh. So this is interactive
part. Why do you think there might not be any results for that particular MySQL query?
Yeah. Okay. So that's a little problem. I could spell cat with misspellings.
These are all overloading my brain. I think we can still make this work out. You guys
can't spoil all my slides before I get to them. All right. The particular problem here
is probably no one's going to, to the first answer, no one's probably going to name their
event cat. Would you like to come to see cat? MySQL solves this for us. We can use a like
query. Like percent cat percent. The results come back as teach your cat to knit an evening
of cat bowling and BYOC cat dance party. We're on board.
That was just a silly thing just to show you that we can probably accomplish this. Let's
get more serious with a more serious query. Someone's likely to be looking for a cat farming
seminar. So we're going to help them. What? Not in a bad way. That might have particular
meaning to you that it doesn't to most of my audiences. Not that event. So anyways,
so how do we look, how do we search for this? If someone comes to our website and they look
for cat farming seminar, select star from events for name like percent cat farming seminar.
Yes. Well, it's in red, which that's the thing I would like to match, but it doesn't
match. Interactive time. What have I done wrong now? Case. That's right. MySQL is all
uppity about case. So this is also not hard. All we have to do is whatever the people type
to us, we lowercase it and it'll still work. Cat farming seminar. So okay, great. That
matches. But seminar for farming of cats, not such a match. Anyone have any ideas how
I can deal with this one? Cats or farming. Well, let's try and. I want to make sure.
Yeah, so okay. So let's do something like this. Good idea. Good idea. And well, it's
starting to itch me a little bit because I heard that like is not as efficient of a query
is just like a pure match. But surely not, right? And we're doing it three times. So
it's kind of like scanning every document in the database three times, right? But we'll
probably shard it and that scale will be fine, I'm sure. So anyways, we do indeed match that
seminar for farming of cats. But we don't yet match making a cat farm the seminar. And
now you're totally in my head because I didn't realize that this was a potentially derogatory
thing. Making a cat farm the seminar. So why does that one not match? Farming and. Well,
they're the same thing, right? Yeah, so. So it's like I with search technologies, they
do a pretty good job about understanding language and I guess we'll have to like cut off the
ends of the words. So farming farm at least that'll match farmer farms that'll match
other stuff. And we do indeed do get back the results we want. I'm trying to poke some
holes in my little theory here though. This is an old presentation. Are you telling me
I should retire my presentation after this time? Oh yes, you're right. Okay, so I should
have updated the dates on my examples on my slide for Mr. Michael Handlin in the front
of me. So next one, cat farm class. Doesn't match either. It's a class. It's kind of like
a little mini seminar. In order to make that work, I'm going to have to do, what am I going
to have to do for that one? Oh, okay, okay. It doesn't match all the terms. But at least
if it matches like a couple of them, that should be good enough, right? So I replaced
my ands with ors to someone's suggestion earlier. And what happens? I do indeed match everything
I want. And I match all these things I don't want. And since there's no notion of which
match is better than the other match, all the stuff with a cat event goes to the top
and this is the whole thing about cat events. So guys, I think we're sunk. I apologize for
taking you through this startup with me. But databases are very good at some things. But
search engines and search technology are very good at a different set of things. In particular,
search engines are quite good at finding documents that not only just match exactly what you
have but contain specific tokens and phrases of the tokens and different mutations of the
tokens. They understand English in a way that I think you'll understand when you leave here.
Scoring and sorting of documents. MySQL finds the set that matches, whereas Elasticsearch,
as we'll see in a little bit, you can put into it an understanding of how good or bad
a match is to particular search terms. And finally, this is something that both MySQL
and Elasticsearch are good at. But it's become an interesting, more recent use case with
search technologies. Searches are actually really good for filtering, grouping, and aggregating
data. So search engines came out of information retrieval field, but they're being used more
and more for log analytics and stuff like that. And we'll touch on that right at the
end.
Alright, so now since we've failed, let's go ahead and get back to the main talk that
you guys came here for. We're going to teach you about Elasticsearch, and in the next 30
minutes, we'll do a really quick and dirty application. I'll show you how to pull down
Elasticsearch, create an index, index stuff, and retrieve it. We'll take a peek under
the hood so that you can see the data structures and algorithms in place. Fortunately, the
data model for Elasticsearch is simple enough that you can leave with a basic understanding
of it. And we'll get, as I promised, we'll get into some of the data aggregation stuff
that Elasticsearch has been used more recently for, and then we'll have hopefully a little
time for questions.
What in particular I want you guys to get out of this is a couple of meta goals. One,
I want you to see me using the very basic implementation of Elasticsearch, and I want
it to be approachable for you guys. So it's a tool on your shelf that you can grab for
and learn more about when you need it. The second thing, and I encourage you to do this
with any technology, any data store technology that you want to use, I want to impart an
intuition about how these data structures work and what they're good at and a little
bit about what they're not good at. This means that when you reach the shelf to get
your tool, you actually get the right tool.
So building a basic search app is not that hard. And you can get, I'd say for, there's
a lot of tuning that comes with Elasticsearch and getting the behavior and the notion of
relevance just right, but getting the thing out of the box and turning it on, it'll
actually get you about 50% of the way there. So it's a real quick technology to get up
and running and get some good results.
In order to install and run Elasticsearch, this is pretty easy. You all probably know
what WGET is. So you can pull down the, find your favorite mirror, pull down Elasticsearch.
In this case, I do need to update my notes here. It's a little bit older version of
Elasticsearch. But pull down, unzip it to wherever you want it to live, cd into that
directory and then start the binary bin slash Elasticsearch. Once you do that, you can just
curl localhost at the Elasticsearch port 9200 and it tells you, hey, you know, for search.
Like, in case you forgot that it was for search. But Elasticsearch is now up and running.
And just like with MySQL, with Elasticsearch, you will want to think in advance about the
type of data that you're going to be interacting with and build a schema for it or, as they
say in Elasticsearch, a mapping. Now, Elasticsearch is interesting here because early on they
advertise that they were a schema-less data store in the age where MongoDB was rocketing
off. Everyone was kind of tacking onto this. And it was true to an extent that you could
just start dumping information into Elasticsearch. And that's gained Elasticsearch a lot of
popularity, but it's still kind of an anti-pattern. So it, in my opinion, over years using this
technology, it's still very important to think through what you're getting ready
to do with this thing. So setting up the mapping is simple. Everything in Elasticsearch is
a JSON interface. And in this particular, this is a Python conference, so every example
that you'll see here, I am using the Python client. But it's really nice. It's really
a fairly thin layer over the JSON interface of Elasticsearch. So when you're setting
up a schema, all you have to do is specify the fields that you're going to have, in
this case, ID, name, description, city, start date, price. And you get all of the things
that you would typically think of existing in a data store. So you have numbers, integers,
floats, strings, dates. It's actually, so you can start to get more complex things like
dates. You can get locations that are a little bit more aware than just two numbers. It knows
what a location is. But one thing I'll be focusing on is not only can you have strings,
but you can say that your strings are special in some way. For example, an ID is a type
of string, but it is a string that is not analyzed. That means that we're not going
to do any special massaging and trying to understand this as a string from natural language.
However, both the name and the description here, I've marked as having an analyzer that
is English. So this is me giving Elasticsearch a hint that not only is this blob of bytes
actually text, but it's text of English. And I'll show you what that means to Elasticsearch
in a little bit. But it's interesting because you don't have to put English here. You can
put Chinese or Japanese or any language, most any language that you'd want. And you can
make up your own stuff. So there's interesting things that you can, extra rules you can put
in for like if you have camel case strings because you're indexing programming languages,
you can break that up and make your own analysis chain for it. And then of course, here's me
using the client. You create Eventbrite with that mapping structure.
Okay, so we have an index set up ready to receive events. Actually adding the events
at that point is pretty simple. You have an array of events and it's just JSON blobs again.
The client is nice because you can use date times and it does the right thing. And then
the simplest version is for just an iterator for every doc that you have, then dump it
into Elasticsearch. This does make an HTTP request for every doc so there are batch methods
once you actually really want to put this into production. That's an easy way to get
up and running.
Okay, so now we've got a bunch of documents in the index. The next bit is to pull stuff
out of it. And the easiest way to explain this, oh yeah, sorry for the microscopic text,
how horrible is that to the people in the back? I'll just speak louder. So the simplest
building block for pulling stuff back is this match all query. And it does exactly what
you think. It's effectively the select star from the events table. It gets everything
back in the order that you indexed it in. And you don't have to understand what is
on the screen here. But I'll provide these notes on my Twitter account later. You can
see it. But it gives you back what you'd expect. It tells you how much time the query
took. It tells you if there's any errors. And obviously, importantly, it gives you all
the hits back. All the documents that match the query sorted by how well they match the
query. In the case of match all, there's no notion of relevance so you just get them
back in the order that you indexed them.
Alright so that was the hallow world of making a query. But there's a lot of different things
you can do to craft the notion of relevance. What is an important document? What should
match? What should not? And the building, the smallest building block for these is the
so-called term query. So if we have an index document, it's in an event in Nashville. If
I wanted to make a filter over all the documents and only hit documents corresponding to the
city Nashville, then that's a term query. I say this is a term, the field is city, the
token is Nashville. The special thing about a term query is, just like earlier where I
said not analyzed, term means that this is just a token. It has to be capital N-A-S-H-V-I-L-L-E.
It doesn't do anything special. And so that's a match. But where it gets interesting and
where you really get a benefit from a search engine is when you start incorporating this
notion of hey this is not just a string, this is actually English text. And so if we have
a sort of stupid document here, name equals filbert sorting for fun and profit, then a
query that is not of type tech term but of type match actually applies that special knowledge
about this is English. And so rather than looking for sort filbert the exact tokens
there, it knows that it can be lower case, we can split on spaces. Sorting and sort should
be basically the same information and so that's a match. So compared to what you think
about how you'd have to do that in MySQL, you would have to make a horrendous query
to make that one simple match right there and it would also be very poor performing
for reasons that I'll get into in a little bit. Getting more and more complicated because
your application has to have a lot of different ideas mixed together, you can do phrase matching.
So not only do we have the notion of matching documents that have these terms but we want
a document that has the term sorting and filbert in it in that order. This is not a match because
the original document had filbert sorting. However, if we search for filbert's space
sort that is a match despite the fact that it's different from the original document.
Original document has upper case and has different parts of speech. But think about as a user
looking for something, you don't quite remember the name of the movie but you're probably
going to get something like this. So getting these type of fuzzy matches is a specialty
of search technology. Filbert fun won't match because there's space between filbert and
fun, just more example of how match phrase works. But you can add this notion of slop
and everyone chuckles when I do that one. That's what it's called. You can add slop and it'll
find any document that has these two words within a space of two. You can go nuts with
this. I once had a gig with a US patent office and their search technology that they were
getting rid of and moving to a different search than the elastic search, solar, they really
wanted to know I want to find this word within the same sentence as some other word and I
want to find it before or within some number of words. So you can take this same behavior
and overload it and get some really complex search behavior. But everything I've showed
you to this point is just atomic. It's like I want this thing or that thing. You have
to have a way of gluing these things together. In elastic search that is a Boolean query.
In normal notions of Boolean queries, in normal notions of Boolean you think ands and ors and
nots. Elastic search has that but using different terminology. Rather than ands we say must
rather than should or we say should and then not is must not. So that one makes pretty
good sense. But the idea and if you play around with a few queries you see why they moved
to this terminology. Usually you have an array of things that must match. So in your elastic
search query you have a must key and so you stick everything that must, all these sub
clauses that must match there. And additionally you have several things that don't have to
match but should match. If they could match, if you could find documents that also happen
to have these other things it should boost a little bit higher. So that's yet another
array of things that if it matches then you get a better score. Each one of these pieces
you have the ability to also adjust weights. So we're starting to get into a notion of
how search understands what's important to your customers and to your business. You can
not only match documents that match the queries but you can also boost documents that we need
to sell quickly because they need their expiring inventory or something like that. And that
leads us to our next big topic, search relevance. I'm curious how many people here have heard
of the notion of TF-IDF? Okay only this half of the room that's interesting. You guys
should have mixed in a little bit more. It's not a hard concept and so I think it's intimidating
at first but I can break it down pretty easily. This will be a little bit of a math-y slide
but not too bad. First off TF is really just means term frequency and I'll get into that.
And IDF means inverse document frequency. And the best way rather than giving you the
Webster's definition, the best way of explaining this is through an example. And let's say
a user comes to your website and makes a search for the diddle. Now that seems odd until you
realize that one of the matching documents in your index is Hey Diddle Diddle the Cat
in the Fill. That's actually a pretty good match for it. So let's do a little practice
round and see what this document would be scored as from the search engine's perspective.
Term frequency is simply the number of times a term occurs in a document. So the TF or
V in this case is 2. The occurrence of V is twice. Similarly, just by coincidence, diddle
also occurs twice. So TF for both of those guys is 2. So far so good? Inverse document
frequency, sometimes I just wish they called it document frequency and just put a 1 over
it. Basically, how many number of times the document frequency is how many number of times
the term occurs not in this document but across the entire set of documents. So document frequency
for the pretty high. So the inverse document for the is just about 0. Makes sense? And
the document frequency for diddle, not a very common word, is about, it only occurs in 7
documents. So it's actually very important and it gets an inverse document score of 1
over 7 which is a lot, lot, lot higher than 0. So when you finally are figuring out the
total score of this document against this query, you put all those pieces together.
The score is the TF IDF score for V plus the TF IDF score for diddle. And you probably
make sense but just be a little bit redundant. TF of V is 2, IDF of V is 0, goes away. TF
of diddle is 1 seventh, or is 2 and IDF is 2 seventh and so you get the final result
of 0.2857 blah, blah, blah. But the idea is every document is going to go through the
same process and be sorted and so the way that you craft your query informs the way
that this math works in the documents that you have 10,000 matches but you want to make
sure you do the right thing so the top 10 search results are what they want.
Okay, so that was a pretty overloading slide. I always like to take a break after heavy
slides like that and I think play work is really therapeutic and in particular I think
that this, this is my favorite one. Ah, that's great. We're going to watch that one more
time. I love this part of the talk. Okay, service is good break. So to this point, how
much time have I got left by the way? So at this point we've done a lot to get you in
the mind space of how search works from a mechanical perspective, how to dump stuff
in, how to pull stuff out, what it can do as compared to other data stores like MySQL
that I was picking on. The next thing that we want to do is dive inside the data store
and give you a little of intuition about how the pieces inside work and what you'll find
is not that complicated. So after this section you'll have a little better understanding
about when it's right to use Elasticsearch and when it's not. So getting data in, in
any data store there's two main chunks that you have to understand. How you get data in
and how you get data out. So that's the outline for the next bit. The first step of getting
data into Elasticsearch is a step called analysis. And basically we're going to take a document
and in this case I've got just one field out of a document and I will show you how it effectively
gets shredded and rearranged and shoved into the data structures that make search technology
so fast. Our example in this case is the sentence, the conspirators conspire conspicuously. I
chose it so that I could almost not pronounce it at a conference. Tokenization, that's
the first step. In this case we have told Elasticsearch, hey this is English and that
gives us some interesting things that we can play off of. We know that English is split
on white space and also punctuation. We can basically throw out punctuation. An interesting
side note that I always like to make here is this is not true of a lot of languages
on the other half of the earth. So like my wife is Japanese and so there are places where
you can have symbols right next to each other and they're different words and doing the
same thing in Japanese, which you still have to do, you have to have a really complex algorithm
to know where the best place is to split these things to make a logical sentence. So tokenization
itself is a fairly deep topic. Next step is actually a fairly shallow topic. Lower casing,
pretty easy but if you have someone type in lower case you better make sure that it matches
a document that has upper case letters. Stop wording, a lot of the words in English are
just noise words. They help us understand where things are placed relative to each other
but they don't really change the content. So we can throw away words like the and is
and was and stuff like that. Perhaps my favorite step of analysis is stemming. This is another
place where because we've given Elasticsearch the hint that this is English, it knows some
interesting tricks to do. If you want a document for farming to match a query for farms, which
is often the case, then effectively what stemming does accomplishes that. You can take a word
and using a statistical technique you can effectively chop off and sometimes modify
the end of these words to make tokens that are easier to match no matter what the intent
was of the people searching. Alright, next step after analysis is indexing. So our example
sentence has turned into these three tokens, conspir, conspir, conspicu. Sounds like Latin.
Let's say that this is document one. The secret sauce of Elasticsearch for being so fast is
effectively during the indexing process it takes these sentences, turns it into a bunch
of tokens and then it effectively transposes that. So instead of document one has these
tokens, at the end of the analysis when you've gone through all of your documents you say
these tokens have these documents. So document one had these tokens but in the end conspir
appeared in document one as well as these two other documents. Conspicu appeared in
document one as well as these three other documents. So effectively from a Python point
of view you could implement this with a dictionary where the keys are tokens and the values are
an array of IDs. Now under the hood this is actually implemented in Java and they do a
lot of sneaky stuff. They shim extra information in the keys so all the notions of document
frequency which we use for scoring gets shoved over into the keys when you look stuff up
and all the notion of term frequency, that's the other half of the TFIDF, are basically
hidden into the values on the right as well as other information like the positions of
the words in the documents so you can do phrase matches and stuff like that. But effectively
a simple search engine is just a Python dictionary like that. Alright so we have now gotten all
the information into the index. The next half of the equation is getting information out
of the index. So our inverted index looks like this and given that data structure what's
the easiest way to find all documents that contain conspicuous and aardvarks? Anyone?
Yep that's all you have to do. Effectively you have, these are lists but they might as
well be sets or iterators and you find whichever one IDs occur in both. And you can build arbitrarily
complex things on the same idea or just a set union and if you combine a more complicated
search it's a set union followed by a different set or set intersection. Pretty easy. So but
that's only half the puzzle because MySQL is really good at finding documents that match.
I just showed you how Elasticsearch finds documents that match efficiently but Elasticsearch
has to turn around and do a sorting algorithm that is part of the important aspect of search.
When Google gives you back the 60,000 results it supposedly says you have for your query
you only see the top 10 and they're usually pretty good. If you scroll down 50,000 pages
they would probably be less good. So it's important to know how that works. Effectively
what happens is when your user gives you a query you have an iterator of all the documents
that match. And so what you do to find the top 10 is you initialize, you have a priority
queue. Do you all know roughly what a priority queue is? We can talk about that. But effectively
what you do is every document that comes through you take it off of that iterator, you look
at all the other secret stuff we've hidden in there and find the score for that document
and now you put the document and that score on your priority queue and there's something
there that's just iterating doing that with every single match that exists. The interesting
aspect of this priority queue though is that it doesn't keep up with every document it
ever sees. It's only of length 10 or whatever you tell it to be. So as soon as you get past
the top 10 documents you've got one that scores lower than the documents then it compares
itself to not even 10 like log or log in or whatever. It compares it to a few of the documents
and says I'm lower than all of these, never think of me again. And so the action is actually
pretty efficient. Now there's a little side note, this is another intuition that might
be important for Elasticsearch. If you're doing some sort of relevance but you also
want to return 100% of the documents, think about how you'd implement that. If I want
to deep paging is what this is called. If you've got a robot scanning your website
for the 10,000th to 10,000th and 10th most fun event, then this means that you have to
have a priority queue that is 10,000 and 10 long and you sort all the documents in, throw
away the first 10,000 of them and give that chunk back. And guess what happens when the
robot goes to the next page carelessly, it just gets worse and worse and worse. So that's
one important intuition to think about search technology. Elasticsearch allows you to turn
that off if you don't care about relevance. But if you do, I would recommend not letting
anyone get past about 500 results. Alright and then Ari said it returns the most high
priority contents from that queue. That's effectively what we do. Like after top 10
they go away. The data structure is only 10 items long so it can't hold anymore than
that. Oh yeah, yeah, yeah. That's not a bad idea. I don't know how I would implement
that in Elasticsearch. I don't think they make that easy for you. But yeah, that totally
checks out. Alright. Okay, so I need a little transition slide here. But effectively that
gets us through everything that a search engine has been until about three years ago. But
Elasticsearch came out of information retrieval, library technology type stuff, finding whatever
I wanted to find. But Elasticsearch has started to prove the point really strongly that the
same data structures that serve search results are actually really good for online analysis,
log parsing, stuff like that. And a big chunk of that is its ability to do aggregations.
And I think I can convince you that it's basically what we were doing before, just one extra
step and you get this nice ability to do aggregations for free almost. So just like before, whenever
we're aggregating over the, you know, we want to find the histogram of the ticket prices
or something like that. We have all the results that we had from before. We do the sorting
like we did with them before. But while we still get that document in hand, we push it
through an aggregator. It's basically just a little in memory thing that says, okay,
how many documents have I seen in, you know, from $10 to $20? And it just increments those
counters. For every document, it does this. And at the end of it, you pass back this aggregator
thing, and you have these really nice results. And it was just something that you did almost
as a byproduct of the actual search itself. So with the building blocks that I've given
you right now, you can see how we have the ability to easily filter, just what a search
is. You can group stuff because you can see as the documents are coming through, you can
already figure out which group it belongs to. And within each group, you can do calculations
to do running averages or anything like that. So to give you a little more intuition about
how you might use aggregation, here is how I encountered it for the first time. Let's
say you go to Amazon, you're chuckling. Have you seen my, that top book, by the way,
is a really excellent book. So anyway, if you go to e-commerce sites, you see a lot
of the original use for aggregations. They were called facets, faceted search. You have
a list of subcategories on the side. You have the counts for how many things are in that
category. You can click on it, and it serves as a filter. It gives you a little bit of
what I call relevance feedback, so you can understand what's actually happening. But
people have taken the same data structure, you turn it on its side, and you've got really
nice histograms, which at Eventbrite, we're making them prettier now, but you can use
them to feedback good information about how many tickets were sold from a particular class.
You can take exactly the same information, but a different data set, and give spark charts
for how many tickets were sold in a particular day. And you can take, again, counts over
buckets, and you can plot it on a map, and you've got a really nice geo information
console to give you intuition about where things are happening in geospatial relationship.
And finally, I don't know exactly how to make a picture for it, but log analytics in
particular are great with Elasticsearch. Building Elasticsearch, building aggregations in Elasticsearch
is easy. I'm going to fly through this, so I have a couple questions. But effectively,
all you have to do is you have your normal query, you keep asking your query like normal,
but you add a new section to your query to Elasticsearch called ags. And in this particular
case, it's going to be hard to read, so I'll blur over it. But you can say things like,
my aggregations, I want you to do counts grouped by city. So that's a term aggregation
where the field is city. And I also want you to do a histogram aggregation for the prices
with an interval of 10. So that's the second thing. The results come back, and you have
the normal search results at the very top, but you have a new section that has these
aggregations in it. In this case, I've got the city bucket right there with my Nashville
and Dallas and BFE events, and I've got my price buckets for what distribution of events
occurred. But a neat thing that you can do, so right now, I really needed a graphic for
this, right now I've got two separate aggregations. A neat thing for us to provide back to our
users is not only the histogram of all the events, but we could do a histogram per each
city. And you can do this with elastic search. Aggregations can be arbitrarily nested. There's
performance issues after some point. But I can say, at the top level, do a terms aggregation,
so we bucket everything by city, and I get the counts back. And then within that aggregation,
do a histogram so that we can show our users, here's the price distribution within the
city that you're interested in. The results turn, come back. Very similar structure, except
if it's appropriately nested, so that for each city bucket, you have the count. And
within that, you have sub buckets for the histogram, so that you can draw it on the
screen. That's effectively it. I've been doing this a while, so I have a lot of things
to learn, but a lot of other things that I would enjoy talking about. Also, if you're
interested in learning more on your own, I know of some reading material. And, you know,
find me on Twitter. Tell me what I did right and what I did wrong. Anyway, that's it.
What have you guys got? Any questions?
So, repeating questions, I guess, right? The question was around how do we deal, we
can specify English or not, but how do we deal with unknown terms, different languages,
jargon terms, stuff like that. The easy answer is you still just say it's English if it's
basically English. And you still get the ability to split on white space and all that stuff,
because that's presumably where you might come from. I'll go to the other extreme
in a second. And you still do stimming, which means if it's like maybe a verb, but it's
a verb I haven't heard before, stimming actually does pretty well for English-like things.
I mean, but if you're willing to put the work in it, you have an arbitrary amount of
control over what you can do. So, at the other extreme end of things, I mean, I guess you
could write your own Java. It's all pluggable. It's just Lucene, Java Lucene. You could
write your own classes to do whatever custom logic you want. If you don't want to go
quite that far, there are other kind of middle ground things like synonyms. You can say,
you know, it's a preprocessing step before you do the stimming and chop off and throw
away the ends of words. You can say, here's a file of every jargon word you might see.
And you can either say, don't touch it for the downstream stuff. Or you can say, you
know, this maps to three other words. Or these three words map to one word. So, there's
a lot of flexibility about what you can do to tune that relevance notion. But it might
be a lot of work. He had a question first.
You, yes and no. Part of that is that not only do we hide the term frequency that counts
for each one of those terms that they occurred in the documents, but we also hide a few other
small things that we stick next to the tokens. We hide its position in the document, which
gets to your answer about phrases. And you can also hide, there's a couple other things
that aren't used as often, but like you can hide part of speech there if you have that
set up. And you can hide a payload, which you can do whatever you want to. You can boost
on documents that have certain words in it a little bit higher. But it's still there.
One thing you can't do though is make a search and reassemble it into the original document
from this data structure. That's why whenever you store a document in Elasticsearch, it
gets shredded, turned into that, and at the same time you have a different file on disk
that's pulled into memory that reads the original document out. So, you're effectively
storing it twice every time.
A document at Eventbrite is an event. And it has what I call the boring field that are
expected. The name, description, the date, geolocation, which actually gets interesting.
But we also have, this is in progress, but we're working on interesting fields like
machine learning things like event cluster that we can later match up with a user cluster
that comes in. Or event quality, which is another thing that we're inferring from the
metadata around it. So those are all things that Elasticsearch is happy with dealing with.
And then there's not too much more than that that's like mind blowing from departure from
what I showed here.
It's Elasticsearch. This is a data structure. It's a JSON record that's, we do exactly that
thing with it. Hand it to Elasticsearch and make it variable. Elasticsearch stores both.
And that is like that secret ID that's collected from that data structure. Hand us a pretty
HDI one.
So
let's say we basically, I've got a
table here, so we're taking it to an ahead of it.
ystall,
and join a favorite data package.
And then that will be a custom demo.
How complex is it to re-emphasize the changes?
Really cool question.
A really cool benefit of Elasticsearch
is it's a write-only index.
So segments on disk effectively are never touched again.
But the caveat is when you actually change a field, what
you do is you go back, find that record where it used
to be written, read out the entire document,
change that one field, and write it to a new segment file.
And the only place you can change the old file
is you mark one bit is dead, tombstone it.
So not great, but it's a trade-off.
You get benefits for treating it that way.
Definitely not a table scan.
It's still pretty quick.
Cool, so I have exactly zero minutes left.
Please come back, talk to me later.
And thank you very much for coming.